Chapter 3 Exploratory Data Analysis

3.1 Start with dplyr counts and summaries in console

  • David Robinson often starts explorig data with simple counts in the console.

  • Here we don’t use the package name (so breaking the rule I just told you) so we can quickly explore the data with dplyr verbs.

3.2 Plot data points with geom_point()

  • After using dplyr count(), group_by() and summarise(), try plotting all data points with ggplot2::geom_point(). It almost NEVER fails to show you what’s going on quickly and is unlikely to return errors.

  • ggplot2::geom_point() is the minimum and most reliable code to start with.

  • Let’s look at all the values of sales for each date.

## Warning: Removed 568 rows containing missing values (geom_point).

  • Now let’s look at the individual sales values over the values of other columns like city.
## Warning: Removed 568 rows containing missing values (geom_point).

  • These very dark lines where we can’t see seperate data points is known as over plotting. We can solve this by replaceing geom_point() with geom_jitter() which randomly “jitters” the points a bit so we can see more of the individually.

  • Sometimes there are so many data points the jitter is not enough to reduce over plotting. We can also make the dots lighter using a a parameter called alpha. The lower the value the fainter the points.

## Warning: Removed 568 rows containing missing values (geom_point).

Hadley Wickham has a few more tricks to solve over plotting in his ggplot2 book overplotting chapter.

  • We all know sales of most things vary by the time of the year. Let’s now put date on the x axis, make city the colour, and because the data is over time we can join the data points using ggplot2::geom_line().

  • We’re also using the reduced data set with fewer ciites so the plot is less crowded. As a rule of thumb, more than about 7 lines can be a confusing plot.

## Warning: Removed 1 rows containing missing values (geom_path).

  • Beautiful. While sales have very different volumes in different cities we can see they follow the same seasonal pattern. To bring the patterns of sales closer to each other and easier to compare we can transform the sales value by taking its log.

  • This is what Hadley Wickham does ggplot2: Elegant Graphics for Data Analysis though this transformation can only be seen in the paid for version of his book.

  • He goes on to model this data by also removing the strong seasonal effects by fitting a linear model between the log of sales and the month, then plottinng the residuals (i.e. the change in sales not explained by the month). This is beyond the scope of this book and interestingly we get at that in the animation.

## Warning: Removed 1 rows containing missing values (geom_path).

3.3 Facet by categories

  • Another logical step after showing categories by colour is to use “small multiples”. This is a fancy way of saying draw a chart for each value of one ore more columns and look at them all at once. Usually in a grid. An important setting here is to specify scales = “free” so each small plot has its own scale. This lets us more easily spot interesting differences in the seasonal pattern between cities.
## Warning: Removed 1 rows containing missing values (geom_path).

3.4 Facet interactively (trelliscopejs)

  • An interactive way to facet (or create small multiples) and explore your data with a GUI in R is trelliscopejs. Here we look at all the US cities facetted by city in a trelliscope web page. Have a play with all the settings and see what it does.

3.5 Loop to plot every category seperately

  • To study each small plot in a larger size on its own quickly we can loop through every city and plot a full chart for each one.

  • To do this we nest a dataframe for each city into one dataframe. Then loop through each nested dataframe creating a plot for each one.

  • First we group then “nest” by city using tidyr::nest()

  • Let’s look at what a nested dataframe looks like.
  • If we wanted we could view one of the nested data frames using square brackets. Think of the numbers in the square brackets like the co-ordiantes in Excel. The first number is the column position and the second number is the row position.
  • We can now add a plot to each nested data frame. We use purrr::map2(). This is a compact way to loop two values of a parametere through one function. In this case the values are the data set insde each row and the value of city colun.
  • Take a look at both the new nested data frame.
  • And lets look at the information held for one of the plots, again using values in square brackets. You’ll see that it’s a series of nested lists that describe every element of the plot.
  • Finally, let’s print every plot quite simply like this.
Show all the looped prints

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]
## Warning: Removed 1 rows containing missing values (geom_path).

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

## 
## [[10]]

## 
## [[11]]

## 
## [[12]]

## 
## [[13]]

## 
## [[14]]


3.6 Polish your final plot

  • We now have a bare minimum Exploratory Data Analysis toolkit of how to explore the data from the console using View(), and then looking at the data points, followed by some line plots.

  • We could soon be ready to decide on the plot we want that tells and interesting story. But adding in all the bells and whistles to make it ready for a customer or a publication can take ages. It shouldn’t be part of your exploratory data analysis.

  • Also, we should use a code style recommended before that lays out your code cleanly. It’s far quicker then to comment out or tweak the values of each part of your plot until it looks just right.

  • So this isn’t necessarily the final perfect plot. There’s things you may want to change depending on what story you want to tell or your personal style. But with this clear ladder of code you can more quickly read, edit, comment chunks out, or run in chunks from the top down (to understand what each bit does like the popular ggplot flip-books).

  • How did I create it? By Googling for what I wanted to do (e.g. “ggplot remove axis grid lines”), copying the code from a stackoverflow answer, and putting it into a clear struture as below.

  • Most of the tweaks or polish you will do will ggplot2::theme() or ggplot2::scale… But are you really going to remember what do to each time? I try not to worry about remembering how to do it and just focus on how I want it to look, getting absorbed in the look.

  • Then, after you have built a few of your own charts with clear code, you will end up going back to your own plots a store of code chunks you can re-use because your code is so well structured. But be prepared for this tweaking to take you much longer than you planned. Always.

## label_key: city
## Saving 7 x 5 in image
## Warning: package 'gdtools' was built under R version 3.6.1
## Warning: Removed 430 rows containing missing values (geom_path).